Making Logistic Regression A Core Data Mining Tool
نویسندگان
چکیده
Binary classification is a core data mining task. For large datasets or real-time applications, desirable classifiers are accurate, fast, and automatic (i.e. no parameter tuning). Naive Bayes and decision trees are fast and parameter-free, but their accuracy is often below state-of-the-art. Linear support vector machines (SVM) are fast and have good accuracy, but current implementations are sensitive to the capacity parameter. SVMs with radial basis function kernels are accurate but slow, and have multiple parameters that require tuning. In this paper we demonstrate that a very simple parameter-free implementation of logistic regression (LR) is sufficiently accurate and fast to compete with state-of-the-art binary classifiers on large real-world datasets. The accuracy is comparable to per-dataset tuned linear SVMs and, in higher dimensions, to tuned RBF SVMs. A combination of regularization, truncated-Newton methods, and iteratively re-weighted least squares make this implementation faster than SVMs and relatively insensitive to parameters. Our fitting procedure, TR-IRLS, appears to outperform several common LR fitting procedures in our experiments. TR-IRLS is robust to linear dependencies and scaling problems in the data, and no data preprocessing is necessary. TR-IRLS is easy to implement and can be used anywhere that IRLS is used. Convergence guarantees can be stated for generalized linear models with canonical links.
منابع مشابه
A Study to Improve the Response in Email Campaigning by Comparing Data Mining Segmentation Approaches in Aditi Technologies
Email marketing is increasingly recognized as an effective Internet marketing tool. In this study, a questionnaire is constructed and distributed to a sample of 146 prospects of Aditi Technologies to find the factors associated with higher response rates. The collected data is analyzed using Factor Analysis and the 11 factors, From Line, Subject Line, Personalization of the subject line, Timing...
متن کاملData Mining for Decision Making in Direct Marketing: a Bayesian Networks Approach with Evolutionary Programming
Given the explosive growth of customer and transactional information, data mining can potentially discover new knowledge to improve managerial decision making in marketing. This study proposes an innovative approach to data mining using Bayesian Networks and evolutionary programming and applies the methods to direct marketing data. The results suggest that this approach to knowledge discovery c...
متن کاملPrediction of the main caving span in longwall mining using fuzzy MCDM technique and statistical method
Immediate roof caving in longwall mining is a complex dynamic process, and it is the core of numerous issues and challenges in this method. Hence, a reliable prediction of the strata behavior and its caving potential is imperative in the planning stage of a longwall project. The span of the main caving is the quantitative criterion that represents cavability. In this paper, two approaches are p...
متن کاملCalculating classifier calibration performance with a custom modification of Weka
Calibration is often overlooked in machine-learning problem-solving approaches, even in situations where an accurate estimation of predicted probabilities, and not only a discrimination between classes, is critical for decision-making. One of the reasons is the lack of readily available open-source software packages which can easily calculate calibration metrics. In order to provide one such to...
متن کاملPrincipal Component Analysis as an Integral Part of Data Mining in Health Informatics
Linear and logistic regression are well-known data mining techniques, however, their ability to deal with interdependent variables is limited. Principal component analysis (PCA) is a prevalent data reduction tool that both transforms the data orthogonally and reduces its dimensionality. In this paper we explore an adaptive hybrid approach where PCA can be used in conjunction with logistic regre...
متن کامل